day 10 - ChristmaSSE KeyGen

I ran this program but it never finished... maybe my computer is too slow. Maybe yours is faster?

https://advent2019.s3.amazonaws.com/326c15f8884fcc13d18a60e2fb933b0e35060efa8a44214e06d589e4e235fe34

Recon

We're given a binary that takes no input and does a very long-running calculation.

With a bit of static analysis, we notice that it's a nested loop with the inner loop doing 1000 iterations for every iteration of the outer loop. The outer loop is supposed to do 0x112210f47de98115 (1234567890123456789 in decimal) iterations. Clearly, this will never be finished in time for the CTF. To put it into perspective, if every outer loop iteration takes 1 nanosecond, it would take ~39 years to finish the computation.

Obviously, we need to figure out what it's trying to do to optimize it. After the outer loop is done, it performs an XOR operation with a block of data given the symbol "flag" and outputs that. The necessary components are the SSE registers xmm3, xmm2, xmm1 and xmm0.

We resorted to a bit of dynamic analysis at first to see if we can find a pattern, we created a binary with a different iteration count and created this gdb script to trace SSE registers state (gdb -x bla.gdb ./slow.elf)

define p128
    printf "%016lx%016lx", $arg0.v4_int64[1], $arg0.v4_int64[0]
end

br *0x4005a3
commands
    printf "xmm0: "
    p128 $ymm0
    printf " -- xmm1: "
    p128 $ymm1
    printf "\n"
    printf "xmm2: "
    p128 $ymm2
    printf " -- xmm3: "
    p128 $ymm3
    printf "\n"

    c
end

xmm0: ****1112****1516****0304****0708 -- xmm1: ****2728****3132****1920****2324
xmm2: ****4344****4748****3536****3940 -- xmm3: ****5960****6364****5152****5556
0x601090 <flag>:    0x0000000000000000  0x0000000000000000
0x6010a0 <flag+16>: 0x0000000000000000  0x0000000000000000
(gdb) c
Continuing.

Breakpoint 2, 0x000000000040077f in main ()
PRINT FLAG
0x601090 <flag>:    0x5960636451525556  0x4344474835363940
0x6010a0 <flag+16>: 0x2728313219202324  0x1112151603040708

Unfortunately, we couldn't identify a pattern quickly by doing that, so we resorted to reverse engineering the application. The structure of the application is as shown below in pseudocode.

Outer loop

Constants

d01 = 0x00000001000000010000000100000001
d02 = 0x00000002000000020000000200000002
d03 = 0x00000003000000030000000300000003
d04 = 0x00000004000000040000000400000004

d11 = 0x00000005000000050000000500000005
d12 = 0x00000006000000060000000600000006
d13 = 0x00000007000000070000000700000007
d14 = 0x00000008000000080000000800000008

d21 = 0x00000009000000090000000900000009
d22 = 0x0000000a0000000a0000000a0000000a
d23 = 0x0000000b0000000b0000000b0000000b
d24 = 0x0000000c0000000c0000000c0000000c

d31 = 0x0000000d0000000d0000000d0000000d
d32 = 0x0000000e0000000e0000000e0000000e
d33 = 0x0000000f0000000f0000000f0000000f
d34 = 0x00000010000000100000001000000010

Loop structure

A register suffixed with b means the value of the register at the beginning of the loop. rep x data[n] is repeat the \(x\)th 32-bit dword, where \(4 \leq x \leq 1\), of the 128-bit data block where \(0 \leq n \leq 5\), and data[0] is at address 0x00601040. The index for the dwords for the repetition is 1-based, where 4 is the highest 32-bit dword and 1 is the lowest (little endian).

xmm3 = d04 * xmm0b + d03 * xmm1b + d02 * xmm2b + d01 * xmm3b
xmm2 = d14 * xmm0b + d13 * xmm1b + d12 * xmm2b + d11 * xmm3b
xmm1 = d24 * xmm0b + d23 * xmm1b + d22 * xmm2b + d21 * xmm3b
xmm0 = d34 * xmm0b + d33 * xmm1b + d32 * xmm2b + d31 * xmm3b

Inner loop

for i in range(1000):
  cmp = 0x0096433d
  for reg in [xmm3, xmm2, xmm1, xmm0]:
    for j, dword in reg:
        if cmp <= dword:
          reg[j] = reg[j] - 1

Reverse engineered python implementation

After quite a bit of clean up, we realized that it's a simple matrix exponentiation of an initial matrix modulo 0x0096433d. Here's the reverse engineered version of the binary, simplifying all the operations down to the basics.

#!/usr/bin/python

import struct
import numpy as np

flag = "fc14eb09bcaee7474fe37cc152a5028e8971c88d9623016d71405aeafd461d23".decode('hex')

n = 0

xmm0 = np.array([0, 0, 0, 1],  dtype=np.int32)
xmm1 = np.array([0, 0, 1, 0],  dtype=np.int32)
xmm2 = np.array([0, 1, 0, 0],  dtype=np.int32)
xmm3 = np.array([1, 0, 0, 0],  dtype=np.int32)
mod = np.repeat(np.array(0x0096433d, dtype=np.int32), 4)

def pack128(v):
    return v.tobytes()

while n < 0x112210f47de98115:
    xmm0b = xmm0
    xmm1b = xmm1
    xmm2b = xmm2
    xmm3b = xmm3

    xmm3 = (4 * xmm0b + 3 * xmm1b + 2 * xmm2b + 1 * xmm3b) % mod
    xmm2 = (8 * xmm0b + 7 * xmm1b + 6 * xmm2b + 5 * xmm3b) % mod
    xmm1 = (12 * xmm0b + 11 * xmm1b + 10 * xmm2b + 9 * xmm3b) % mod
    xmm0 = (16 * xmm0b + 15 * xmm1b + 14 * xmm2b + 13 * xmm3b) % mod

    n = n + 1

xmm_enc = pack128(xmm3)
xmm_enc += pack128(xmm2)
xmm_enc += pack128(xmm1)
xmm_enc += pack128(xmm0)

flag_o = "AOTW{"

for idx in xrange(0, 32, 2):
    flag_o += chr(ord(flag[idx + 0]) ^ ord(xmm_enc[(idx * 2) + 0]))
    flag_o += chr(ord(flag[idx + 1]) ^ ord(xmm_enc[(idx * 2) + 1]))

flag_o += "}"

print(flag_o)

Solution

Knowing it's a matrix exponentiation operation modulo a certain integer, we can speed the process up with modular exponentiation. Here is the solution in sagemath:

#!/usr/bin/env python
# coding: utf-8

# In[1]:
matrix = matrix(Integers(0x0096433d), 4, 4, range(1, 17))


# Output from known working impl (at iters = 0x100000):
# ```
# [5523814 9753103 4134779 8364068]
# [8834623 3475204 7963398 2603979]
# [2297819 7044918 1944404 6691503]
# [5608628  767019 5773023  931414]
# ```

# In[7]:
matrix ** (0x100000)


# In[8]:
flag = "fc14eb09bcaee7474fe37cc152a5028e8971c88d9623016d71405aeafd461d23".decode('hex')
xmm3, xmm2, xmm1, xmm0 = (matrix ** 0x112210f47de98115)


# In[14]:
import numpy as np
xmm_enc = np.array(xmm3, dtype=np.int32).tobytes()
xmm_enc += np.array(xmm2, dtype=np.int32).tobytes()
xmm_enc += np.array(xmm1, dtype=np.int32).tobytes()
xmm_enc += np.array(xmm0, dtype=np.int32).tobytes()

flag_o = "AOTW{"
for idx in xrange(0, 32, 2):
    flag_o += chr(ord(flag[idx + 0]) ^^ ord(xmm_enc[(idx * 2) + 0]))
    flag_o += chr(ord(flag[idx + 1]) ^^ ord(xmm_enc[(idx * 2) + 1]))

flag_o += "}"
flag_o

Flag

AOTW{M4tr1x_3xp0n3nti4t1on_5728391723}